43 research outputs found

    CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

    Get PDF
    We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification

    Patchy promiscuity:machine learning applied to predict the host specificity of <i>Salmonella enterica </i>and <i>Escherichia coli</i>

    Get PDF
    Supporting data for Patchy promiscuity: machine learning applied to predict the host specificity of Salmonella enterica and Escherichia coli, as published in <em>Microbial Genomics</em

    The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella Typhimurium from the USA

    Get PDF
    Salmonella enterica is a taxonomically diverse pathogen with over 2600 serovars associated with a wide variety of animal hosts including humans, other mammals, birds and reptiles. Some serovars are host-specific or host-restricted and cause disease in distinct host species, while others, such as serovar S. Typhimurium (STm), are generalists and have the potential to colonize a wide variety of species. However, even within generalist serovars such as STm it is becoming clear that pathovariants exist that differ in tropism and virulence. Identifying the genetic factors underlying host specificity is complex, but the availability of thousands of genome sequences and advances in machine learning have made it possible to build specific host prediction models to aid outbreak control and predict the human pathogenic potential of isolates from animals and other reservoirs. We have advanced this area by building host-association prediction models trained on a wide range of genomic features and compared them with predictions based on nearest-neighbour phylogeny. SNPs, protein variants (PVs), antimicrobial resistance (AMR) profiles and intergenic regions (IGRs) were extracted from 3883 high-quality STm assemblies collected from humans, swine, bovine and poultry in the USA, and used to construct Random Forest (RF) machine learning models. An additional 244 recent STm assemblies from farm animals were used as a test set for further validation. The models based on PVs and IGRs had the best performance in terms of predicting the host of origin of isolates and outperformed nearest-neighbour phylogenetic host prediction as well as models based on SNPs or AMR data. However, the models did not yield reliable predictions when tested with isolates that were phylogenetically distinct from the training set. The IGR and PV models were often able to differentiate human isolates in clusters where the majority of isolates were from a single animal source. Notably, IGRs were the feature with the best performance across multiple models which may be due to IGRs acting as both a representation of their flanking genes, equivalent to PVs, while also capturing genomic regulatory variation, such as altered promoter regions. The IGR and PV models predict that ~45 % of the human infections with STm in the USA originate from bovine, ~40 % from poultry and ~14.5 % from swine, although sequences of isolates from other sources were not used for training. In summary, the research demonstrates a significant gain in accuracy for models with IGRs and PVs as features compared to SNP-based and core genome phylogeny predictions when applied within the existing population structure. This article contains data hosted by Microreact

    Acquisition and loss of CTX-M plasmids in Shigella species associated with MSM transmission in the UK

    Get PDF
    Shigellosis in men who have sex with men (MSM) is caused by multidrug resistant Shigellae, exhibiting resistance to antimicrobials including azithromycin, ciprofloxacin and more recently the third-generation cephalosporins. We sequenced four bla (CTX-M-27)-positive MSM Shigella isolates (2018–20) using Oxford Nanopore Technologies; three S. sonnei (identified as two MSM clade 2, one MSM clade 5) and one S. flexneri 3a, to explore AMR context. All S. sonnei isolates harboured Tn7/Int2 chromosomal integrons, whereas S. flexneri 3a contained the Shigella Resistance Locus. All strains harboured IncFII pKSR100-like plasmids (67-83kbp); where present bla (CTX-M-27) was located on these plasmids flanked by IS26 and IS903B, however bla (CTX-M-27) was lost in S. flexneri 3a during storage between Illumina and Nanopore sequencing. IncFII AMR regions were mosaic and likely reorganised by IS26; three of the four plasmids contained azithromycin-resistance genes erm(B) and mph(A) and one harboured the pKSR100 integron. Additionally, all S. sonnei isolates possessed a large IncB/O/K/Z plasmid, two of which carried aph(3’)-Ib/aph(6)-Id/sul2 and tet(A). Monitoring the transmission of mobile genetic elements with co-located AMR determinants is necessary to inform empirical treatment guidance and clinical management of MSM-associated shigellosis

    Closing gaps for performing a risk assessment on Listeria monocytogenes in ready-to-eat (RTE) foods : activity 3, the comparison of isolates from different compartments along the food chain, and from humans using whole genome sequencing (WGS) analysis

    Get PDF
    We would like to thank all the persons and institutes that have provided the project with isolates and accompanying information. Without them, this project would not have been possible. Lin Cathrine T. Brandal, Norwegian Institute of Public Health, Norway Julio Vázquez Moreno and Raquel Abad Torreblanca, Instituto de Salud Carlos III, Spain Marc Lecuit, Institut Pasteur, France Alexandre Leclercq, Institut Pasteur, France Iva Hristova, National Center of Infectious and Parasitic Diseases, Bulgaria Marija Trkov, National Laboratory of Health, Environment and Food, Slovenia Cecilia Jernberg, Public Health Agency of Sweden, Sweden Ariane Pietzka, Austrian Agency for Health and Food Safety, Austria Eelco Franz and Ingrid Friesema, RIVM, The Netherlands Carlo Spanu, University of Sassari Sardinia Ifip, French Institute for Pig and Pork Industry, Maisons-Alfort, France All the NRLs for providing the isolates from the EU baseline study Special thanks to Sylvain Brisse and Alexandra Moura, Institut Pasteur, France, for providing cgMLST data. The authors would also like to thank the EFSA staff members: Maria Teresa da Silva Felicio, Beatriz Guerra, Ernesto Lìebana and Valentina Rizzi as well as the members of the Working Group on Listeria monocytogenes contamination of ready-to-eat foods: Kostas Koutsoumanis, Roland Lindqvist, Moez Sanaa, Panagiotis Skandamis, Niko Speybroek, Johanna Takkinen and Martin Wagner for the support, revisions and suggestions during the development of the present procurement activity and report.Publisher PD

    The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution

    Get PDF
    We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt

    Enteroaggregative escherichia coli have evolved independently as distinct complexes within the E. Coli population with varying ability to cause disease

    Get PDF
    Enteroaggregative E. Coli (EAEC) is an established diarrhoeagenic pathotype. The association with virulence gene content and ability to cause disease has been studied but little is known about the population structure of EAEC and how this pathotype evolved. Analysis by Multi Locus Sequence Typing of 564 EAEC isolates from cases and controls in Bangladesh, Nigeria and the UK spanning the past 29 years, revealed multiple successful lineages of EAEC. The population structure of EAEC indicates some clusters are statistically associated with disease or carriage, further highlighting the heterogeneous nature of this group of organisms. Different clusters have evolved independently as a result of both mutational and recombination events; the EAEC phenotype is distributed throughout the population of E. coli

    Recurrent seasonal outbreak of an emerging serotype of Shiga toxin-producing Escherichia coli (STEC O55:H7 Stx2a) in the south west of England, July 2014 to September 2015.

    Get PDF
    The first documented British outbreak of Shiga toxin-producing Escherichia coli (STEC) O55:H7 began in the county of Dorset, England, in July 2014. Since then, there have been a total of 31 cases of which 13 presented with haemolytic uraemic syndrome (HUS). The outbreak strain had Shiga toxin (Stx) subtype 2a associated with an elevated risk of HUS. This strain had not previously been isolated from humans or animals in England. The only epidemiological link was living in or having close links to two areas in Dorset. Extensive investigations included testing of animals and household pets. Control measures included extended screening, iterative interviewing and exclusion of cases and high risk contacts. Whole genome sequencing (WGS) confirmed that all the cases were infected with similar strains. A specific source could not be identified. The combination of epidemiological investigation and WGS indicated, however, that this outbreak was possibly caused by recurrent introductions from a local endemic zoonotic source, that a highly similar endemic reservoir appears to exist in the Republic of Ireland but has not been identified elsewhere, and that a subset of cases was associated with human-to-human transmission in a nursery

    MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island

    Get PDF
    Short-read, high-throughput sequencing technology cannot identify the chromosomal position of repetitive insertion sequences that typically flank horizontally acquired genes such as bacterial virulence genes and antibiotic resistance genes. The MinION nanopore sequencer can produce long sequencing reads on a device similar in size to a USB memory stick. Here we apply a MinION sequencer to resolve the structure and chromosomal insertion site of a composite antibiotic resistance island in Salmonella Typhi Haplotype 58. Nanopore sequencing data from a single 18-h run was used to create a scaffold for an assembly generated from short-read Illumina data. Our results demonstrate the potential of the MinION device in clinical laboratories to fully characterize the epidemic spread of bacterial pathogens
    corecore